Small and Stable Descriptors of Distributions for Geometric Statistical Problems
نویسنده
چکیده
This thesis explores how to sparsely represent distributions of points for geometric statistical problems. A coreset C is a small summary of a point set P such that if a certain statistic is computed on P and C, then the difference in the results is guaranteed to be bounded by a parameter ε. Two examples of coresets are εsamples and ε-kernels. An ε-sample can estimate the density of a point set in any range from a geometric family of ranges (e.g., disks, axis-aligned rectangles). An ε-kernel approximates the width of a point set in all directions. Both coresets have size that depends only on ε, the error parameter, not the size of the original data set. We demonstrate several improvements to these coresets and how they are useful for geometric statistical problems. We reduce the size of ε-samples for density queries in axis-aligned rectangles to nearly a square root of the size when the queries are with respect to more general families of shapes, such as disks. We also show how to construct ε-samples of probability distributions. We show how to maintain “stable” ε-kernels, that is, if the point set P changes by a small amount, then the ε-kernel also changes by a small amount. This is useful in surveillance and tracking problems, and the stable properties leads to more efficient algorithms for maintaining ε-kernels. We next study when the input point sets are uncertain and their uncertainty is modeled by probability distributions. Statistics on these point sets (e.g., radius of smallest enclosing ball) do not have exact answers, but rather distributions of answers. We describe data structures to represent approximations of these distributions and algorithms to compute them. We also show how to create distributions of ε-kernels and ε-samples for these uncertain data sets. Finally, we examine a spatial anomaly detection problem: computing a spatial scan statistic. The input is a point set P and measurements on the point set. The spatial scan statistic finds the range (e.g., an axis-aligned bounding box) where the measurements inside the range are the most different from measurements outside of the range. We show how to compute this statistic efficiently while allowing for a bounded amount of approximation error. This result generalizes to several statistical models and types of input point sets.
منابع مشابه
Classification and properties of acyclic discrete phase-type distributions based on geometric and shifted geometric distributions
Acyclic phase-type distributions form a versatile model, serving as approximations to many probability distributions in various circumstances. They exhibit special properties and characteristics that usually make their applications attractive. Compared to acyclic continuous phase-type (ACPH) distributions, acyclic discrete phase-type (ADPH) distributions and their subclasses (ADPH family) have ...
متن کاملDetermination of critical properties of Alkanes derivatives using multiple linear regression
This study presents some mathematical methods for estimating the critical properties of 40 different types of alkanes and their derivatives including critical temperature, critical pressure and critical volume. This algorithm used QSPR modeling based on graph theory, several structural indices, and geometric descriptors of chemical compounds. Multiple linear regression was used to estimate the ...
متن کاملM-estimators as GMM for Stable Laws Discretizations
This paper is devoted to "Some Discrete Distributions Generated by Standard Stable Densities" (in short, Discrete Stable Densities). The large-sample properties of M-estimators as obtained by the "Generalized Method of Moments" (GMM) are discussed for such distributions. Some corollaries are proposed. Moreover, using the respective results we demonstrate the large-sample pro...
متن کاملBootstrapping Descriptors for Non-Euclidean Data
For data carrying a non-Euclidean geometric structure it is natural to perform statistics via geometric descriptors. Typical candidates are means, geodesics, or more generally, lower dimensional subspaces, which carry specific structure. Asymptotic theory for such descriptors is slowly unfolding and its application to statistical testing usually requires one more step: Assessing the distributio...
متن کاملPower Normal-Geometric Distribution: Model, Properties and Applications
In this paper, we introduce a new skewed distribution of which normal and power normal distributions are two special cases. This distribution is obtained by taking geometric maximum of independent identically distributed power normal random variables. We call this distribution as the power normal--geometric distribution. Some mathematical properties of the new distribution are presented. Maximu...
متن کامل